Udacity Project 4 EDA by STEVEN_ROSENFIELD

Data Description

This data contains 1599 instances of red wine of the Portugeses “Vinho Verde” variety. For each instance, there are 11 variables contain info about the chemical properties of the wine, and 1 rating that corresponds to the quality of the wine. The quality of the wine is based on the average of 3 wine expert ratings, from 0 (very bad) to 10 (very excellent).

Univariate Plots Section

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Some observations of these box plots: 1 - The variables with relatively high spread include citric acid, alcohol, and quality. 2 - The variables with relatively low spread include residual sugar, chlorides, and sulphates. 3 - The variables with approximately normal distributions include citric acid, density, pH, alcohol, and quality.

These boxplots are not detailed enough to depict the distribution of each feature, so let’s examine one of the variables a little more closely with a histogram:

Let’s examine the dependent variable quality a little more closely.

This plot shows a histogram of quality. It’s interesting that there were no values less than 3, or greater than 8. It also seems like most of the values for quality were either 5 or 6.

## [1] 0.8248906

About 83.5% of all observations had quality = 5 or 6. This confirms my observation from the histogram.

My initial theory is alcohol content will have an effect on quality, so I’m going to break down the alcohol variable by quartile, and store this data in a new variable called alcohol.quartile with values of ‘low’, ‘mid-low’, ‘mid-high’, and ‘high’ depending on the quartile the alcohol content falls into.

## 
##     high      low mid-high  mid-low 
##      407      297      396      499

Now I will break down the volatile acidity into halves (above median vs. below median) and label each row with’low’ or ‘high’ and store this data ina new variable vol.acid.half. I will also create a factor variable for volatile acidity quartile to order it correctly for plotting.

## 
## high  low 
##  816  783

Now I will create a factor variable for alcohol quartile to order it correctly for plotting.

It looks like the buckets are close to evenly distributed.

Let’s break the sulphates into buckets as well by rounding each value down to the nearest tenth and storing the result in a variable called sulphates_bucket.

## 
## 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 1.1 1.2 1.3 1.5 1.6 1.9   2 
##   9 142 503 446 251 138  51  22  18   5   6   2   2   3   1

There is a low amount of wine with sulphates>=1, so let’s group them all into one bucket. Also, lets group the .3’s in with the .4’s because there are only 9 wines with a value of .3.

## 
## 0.4 0.5 0.6 0.7 0.8 0.9   1 
## 151 503 446 251 138  51  59

This is the number of occurences for each sulphates bucket. It looks like cleaner data to work with!

Let’s create a histogram to confirm.

The sulphates buckets look approximately normally distributed.

Univariate Analysis

What is the structure of your dataset?

The dataset includes 1599 observations of 13 variables. All 13 variables are are of the type number except X and quality, which are integers.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in my dataset is the dependent variable quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

As of now, any of the 11 independent variables can help support my investigation into my feature of interest. I will have to explore them more in the bivariate section.

Did you create any new variables from existing variables in the dataset?

Yes, I created the alcohol.quartile variable based on the quartile that the alcohol variable falls into for each observation.

I also created wine$alc.qua.fac to organize alcohol.quartile in order.

I also created a new variable to classify volatile acidity into buckets.

I also created a new variable to classify sulphates into buckets.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Both residual sugar and chlorides had very long tails skewed to the right.

Sulphates also had a relatively long tail skewed right, so when I grouped sulphates into buckets with a new variable, I grouped the outliers into buckets that will make the data easier to visualize going forward.

Bivariate Plots Section

Let’s create a box plot of alcohol vs. quality

Based on these box plots, it looks like alcohol is positively correlated with quality. Let’s investigate further with a scatterplot.

This scatter-plot shows a positive correlation between alcohol content and quality, affirming my suspicion.

Let’s see how the rest of the variables correlate with quality.

Based on these plots, it looks like alcohol, sulphates, and citric acid have relatively large positive correlation with quality. Also, it looks like volatile acidity, chlorides, and total sulfur dioxide have relatively large negative correlation with quality.

Let’s calculate the Pearson correlation coefficient between all numeric variables

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

Based on these correlation coefficients, it looks like alcohol has the highest correlation with quality with a Pearson corr coef = .476. The next highest is sulphates with R = .251 and then citric acid with R=.226.

The most negative correlation coefficients are volatile acidity with R = -.391 and density with R = -.175.

This confirms what what I saw in the earlier graphs.

The strongest relationship is fixed acidity to citric acid with R = .672, and the most negative relationship is fixed acidity to pH with R = -.683. These variables would logically have strong relationships because they all describe acidity.

The highest R value for a relationship that stuck out to me was density to fixed acidity with R = .668. Let’s explore that more.

High correlation in this graph!

Let’s look at alcohol vs. volatile acidity because it looks like these two variables have the strongest correlations to quality

There’s a slight negative correlation, which makes sense because they had a correlation coefficient = -.202

Now I want to explore volatile acidity’s correlation with quality, because I noted that it has a strong negative correlation.

This shows as the quality increases, the volatile acidity decreases.

Now I want to explore sulphates’s correlation with quality, because I noted that it has a relatively high positive correlation.

The slight positive correlation appears in this chart.

Let’s group the quality into groups and analyze the data that way.

## # A tibble: 6 × 4
##   quality  alc_mean alc_median     n
##     <int>     <dbl>      <dbl> <int>
## 1       3  9.955000      9.925    10
## 2       4 10.265094     10.000    53
## 3       5  9.899706      9.700   681
## 4       6 10.629519     10.500   638
## 5       7 11.465913     11.500   199
## 6       8 12.094444     12.150    18

Let’s perform more analysis on other variables:

Not much correlation here.

Slight negative correlation between quality and density

Slight negative correlation here.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Based on these correlation coefficients, it looks like alcohol has the highest correlation with quality (Pearson corr coef = .476). The next highest is sulphates with R = .251 and then citric acid with R=.226.

The most negative correlation coefficients with quality are volatile acidity with R = -.391 and density with R = -.175.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The strongest relationship out of all the data is fixed acidity to citric acid with R = .672, and the most negative relationship is fixed acidity to pH with R = -.683. These variables would logically have strong relationships because they all describe acidity.

The highest R value for a relationship that stuck out to me was density to fixed acidity with R = .668.

What was the strongest relationship you found?

As mentioned earlier, the strongest relationship I saw was fixed acidity to citric acid with R = .672. However, this makes sense because they both describe acidity.

The highest R value for two variables that didn’t definitely describe the same thing was density to fixed acidity with R = .668.

Multivariate Plots Section

Let’s look at the two strongest correlated variables with quality in one graph, alcohol and volatile acidity.

This chart isn’t very useful because it’s too busy. Let’s try a scatterplot.

It looks like what I expected. The higher quality wine is more purple (high alcohol content), and falls towards the left on the graph (low volatile acidity).

Now let’s try setting alcohol as the x-axis and volatile acidity as the categorical variable.

This is a good looking chart! It clearly shows as alcohol content increases quality increases, and lower volatile acidity corresponds to higher quality.

Let’s look at alcohol vs. volatile acidity broken into sulphate groups. These are the 3 most correlated variables to quality

This graph doesn’t seem to imply much correlation between these variables at all.

Let’s look at sulphates bucket vs. alcohol colored by quality.

Again this isn’t a very useful chart, because it doesn’t show much correlation.

Let’s look at quality against sulphates, grouped by alcohol bucket

This chart is a little too busy to gather much insight.

Let’s look at alcohol vs. volatile acidity, grouped by quality

This plot shows that as volatile acidity is lower, the quality is higher. It also shows that as alcohol is higher, the quality is higher. Finally, it shows that alcohol has a very weak negative correlation with volatile acidity.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I observed that alcohol content corresponds to increased quality, whereas volatile acidity content corresponds to decreased quality.

Sulphates seemed arbritrary with respect to quality.

Were there any interesting or surprising interactions between features?

The most interesting interaction between features is that alcohol content corresponds to increased quality, whereas volatile acidity content corresponds to decreased quality.

Other than that, it wasn’t too interesting because there weren’t many attributes that had a large affect on quality.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I did not create any models with my dataset.


Final Plots and Summary

Plot One

Description One

I chose this plot because it shows how each of the attributes can affect the variable of intereset, quality. It helped me gather insight into the data to know which attributes I should explore more closely, and which attributes I can probably disregard because they have a loose connection to quality.

Plot Two

Description Two

This plot shows how the quality of wine varies directly with alcohol content. To make this plot, I first plotted a scatter plot of quality vs. alcohol, and then I plotted a boxplot for each value of quality vs. alcohol on top of the scatter plot. Finally, I showed the mean value of alcohol content for each quality group with a red dot. This plot shows a strong correlation between quality and alcohol, and it seems alcohol content drives the quality score more than any other attribute.

Plot Three

Description Three

I chose this plot because it shows the two attributes with the strongest correlation to quality. As Alcohol % by volume increases, the quality increases. On the other hand, wines with high Volatile Acidity are more likely to be a lower quality wine than wines with low Volatile Acidity.


Reflection

This project taught me a lot about how to use R to perform Exploratory Data Analysis. I chose this dataset because I like to drink wine, and I was looking forward to seeing if there were any scientific reasons as to why one bottle of wine was better than another bottle of wine. I was able to gather some insight into what makes experts consider a bottle of wine good. However, the dataset did not allow me to gain as much insight as I would’ve liked for several reasons.

If these 3 experts’ ratings are to be trusted, there were many insights I found into wine quality. First of all, I learned that higher alcohol content often contributes to better quality wine. Also, I learned that high volatile acidity contributes to lower quality wine. This makes sense because this is “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.” Out of the other 9 variables, I couldn’t gather too much insight into what constitutes good quality wine, because they had low correlations to the quality variable.

Unfortunately, the dataset I chose didn’t allow me to make too many interesting insights or visualizations for several reasons. First of all, the quality values were only integers. This was very frustrating because if this value took fractions, I could’ve gathered more accurate insights into how other variables contributed to it. Also, the average of the 3 experts’ ratings naturally should’ve been decimals, so it was confusing why the data only had integers.

Another reason the data wasn’t great was there wasn’t much correlation between the variables. For example no variable given in the dataset had a higher correlation coefficient than .5 with quality.
If there was better correlation, the graphs would’ve looked better and more insightful for my audience.

If future work was going to be done with this dataset, I would suggest collecting data from more than 3 experts, and then storing the quality rating as a decimal in order to get more granular insight into how other variables contributed to the rating.

Overall, I got valuable experience working with R and performing Exploratory Data Analysis, but I wish I had chosen a better dataset to work with for this project.

Thanks for reading!

Resources

www.stackoverflow.com

https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html